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A parametric method similar to autoregressive spectral 
estimators is proposed to determine the probability density 
function (pdf) of a random set. The method proceeds by 
maximizing the likelihood of the pdf, yielding estimates that 
perform equally well in the tails as in the bulk of the distri- 
bution. It is therefore well suited for the analysis short sets 
drawn from smooth pdfs and stands out by the simplicity of 
its computational scheme. Its advantages and limitations are 
discussed. 

PACS numbers : 

02.50.Ng Distribution theory and Monte Carlo studies 
02.70.Hm Spectral methods 



I. INTRODUCTION 

There are many applications in which it is necessary 
to estimate the probability density function (pdf) from 
a finite sample of n observations {x\, X2, ■ ■ ■ , x n } whose 
true pdf is f(x). Here we consider the generic case in 
which the identically distributed (but not necessarily in- 
dependent) random variables have a compact support 
Xk 6 [a, b]. 

The usual starting point for a pdf estimation is the 
naive estimate 

1 n 

fs(x) = - Y]6(x-Xi) , (1) 
i=i 

where <5(.) stands for the Dirac delta function. Although 
this definition has a number of advantages, it is useless for 
practical purposes since a smooth functional is needed. 

Our problem consists in finding an estimate f(x) whose 
integral over an interval of given length converges toward 
that of the true pdf as n — ► oo. Many solutions have been 
developed for that purpose: foremost among these are 
kernel techniques in which the estimate fs(x) is smoothed 
locally using a kernel function K{x) Jl]-§] 

fo)= f -k(—) fs(v)4y, (2) 

whose width is controlled by the parameter w. The well- 
known histogram is a variant of this technique. Although 
kernel approaches are by far the most popular ones, the 



choice of a suitable width remains a basic problem for 
which visual guidance is often needed. More generally, 
one faces the problem of choosing a good partition. Some 
solutions include Bayesian approaches Q , polynomial fits 
H and methods based on wavelet filtering || . 

An alternative approach, considered by many authors 
0-0, is a projection of the pdf on orthogonal functions 

k 

where the partition problem is now treated in dual space. 
This parametric approach has a number of interesting 
properties: a finite expansion often suffices to obtain a 
good approximation of the pdf and the convergence of the 
series versus the sample size n is generally faster than for 
kernel estimates. A strong point is its global character, 
since the pdf is fitted globally, yielding estimates that 
are better behaved in regions where the lack of statistics 
causes kernel estimates to perform poorly. Such a prop- 
erty is particularly relevant for the analysis of turbulent 
wavefields, in which the tails of the distribution are of 
great interest (e.g. fl^l ). 

These advantages, however, should be weighed against 
a number of downsides. Orthogonal series do not pro- 
vide consistent estimates of the pdf since for increasing 
number of terms they converge toward fs(x) instead of 
the true density f(x) (T^|. Furthermore, most series can 
only handle continuous or pieccwisc continuous densities. 
Finally, the pdf estimates obtained that way are not guar- 
anteed to be nonnegative (see for example the problems 
encountered in 0). 

The first problem is not a major obstacle, since most 
experimental distributions are smooth anyway. The sec- 
ond one is more problematic. In this paper we show how 
it can be partly overcome by using a Fourier series ex- 
pansion of the pdf and seeking a maximization of the 
likelihood 

L= f \ogf{x)dx . (4) 

J a 

The problem of choosing an appropriate partition then 
reduces to that of fitting the pdf with a positive definite 
Pade approximant 

Our motivation for presenting this particular paramet- 
ric approach stems from its robustness, its simplicity and 
the originality of the computational scheme it leads to. 
The latter, as will be shown later, is closely related to the 
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problem of estimating power spectral densities with au- 
toregressive (AR) or maximum entropy methods jl6] |l8| . 
To the best of our knowledge, the only earlier reference to 
similar work is that by Carmichael |19| ; here we empha- 
size the relevance of the method for estimating pdfs and 
propose a criterion for choosing the optimum number of 
basis functions. 



II. THE MAXIMUM LIKELIHOOD APPROACH 

The method we now describe basically involves a pro- 
jection of the pdf on a Fourier series. The correspondence 
between the continuous pdf f(x) and its discrete charac- 
teristic function <p k is established by |2C|j 



f(x) e jkx dx 



+ OC 



f{x) = 27T 



-jkx 



(5) 
(6) 



k= — oo 



where <fi k = 4>*_ k 6 C is hermitian Note that we have 
applied a linear transformation to convert the support 
from [a, b] to [— 7T,7r]. 

For a finite sample, an unbiased estimate of the charac- 
teristic function is obtained by inserting eq. [l] into eq. ||, 
giving 



1 " 



Jkxi 



(7) 



The main problem now consists in recovering the pdf 
from eq. ^ while avoiding the infinite summation. By 
working in dual space we have substituted the partition 
choice problem by that of selecting the number of relevant 
terms in the Fourier series expansion. 

The simplest choice would be to truncate the series at 
a given "wave number" p and discard the other ones 



+p 

f{x) = 2tt £ «£ fc e ~ ik * 

k=—p 



(8) 



Such a truncation is equivalent to keeping the lowest wave 
numbers and thus filtering out small details of the pdf. 
Incidentally, this solution is equivalent to a kernel filter- 
ing with K(x) — sm{'Kx)/'KX as kernel. This kernel is 
usually avoided because it suffers from many drawbacks 
such as the generation of spurious oscillations. 

An interesting improvement was suggested by Burg in 
the context of spectral density estimation (see for exam- 
ple Jll|[l7j ) • The heuristic idea is to keep some of the low 
wave number terms while the remaining ones, instead of 
being set to zero, are left as free parameters: 



+ OG 



-jxk 



f(x) = 2tt ^2 a k e 

k— — QG 

with a k = cj) k , \k\ <p . 



(9) 



The parameters a k , for |fc| > p, are then fixed self- 
consistently according to some criterion. 

We make use of this freedom to constrain the solution 
to a particular class of estimates. Without any prior in- 
formation at hand, a reasonable choice is to select the 
estimate that contains the least possible information or 
is the most likely. It is therefore natural to seek a maxi- 
mization of an entropic quantity such as the sample en- 
tropy 



H = - f(x)logf(x)dx , 

J —IT 

or the sample likelihood 



L = 



log fix) dx 



(10) 



(11) 



We are a priori inclined to choose the entropy because 
our objective is the estimation of the pdf and not that 
of the characteristic function. However, numerical inves- 
tigations done in the context of spectral density estima- 
tion rather lend support to the likelihood criterion p2[ . 
A different and stronger motivation for preferring a max- 
imization of the likelihood comes from the simplicity of 
the computational scheme it gives rise to. 

This maximization means that the tail of the charac- 
teristic function is chosen subject to the constraint 



dL 

da k 



0, \k\>p 



(12) 



From eqs. ^ and [IT] the likelihood can be rewritten as 



L = 



/ + 7T / +00 

log 27T £ 

- 71 \ fc=-oc 



a k e 



- jxk dx 



(13) 



As shown in the appendix, the likelihood is maximized 
when the pdf can be expressed by the functional 



f P (x) 



1 



(14) 



which is a particular case of a Pade approximant with 
poles only and no zeros [ft5|. Requiring that f P ix) is real 
and bounded, it can be rewritten as 



fp(x) = 7T- 



1 



27r |l + aie-J* H h a p e-iv x Y 



(15) 



The values of the coefficients {ai, . . . , a p } and of the nor- 
malization constant £o ar e set by the condition that the 
Fourier transform of f p {x) must match the sample char- 
acteristic function cj) k for \k\ < p. 

This solution has a number of remarkable properties, 
some of which are deferred to the appendix. Foremost 
among these are its positive definite character and the 
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simple relationship which links the polynomial coeffi- 
cients {ai, . . . , a p } to the characteristic function on which 
they perform a regression. Indeed, we have 



f>k + ai<f>k-i + a 2 <\>k-2 H 1- a p (f>k-p = 0, 

1 < k < p . 



(16) 



This can be cast in a set of Yule- Walker equations whose 
unique solution contains the polynomial coefficients 



-p+i 

-p+2 



Dp-2 





01 












02 








. 4>p . 



(17) 



Advantage can be taken here of the Toeplitz structure of 
the matrix. The proper normalization ( J_ "* f(x) Ax = 1) 
of the pdf is ensured by the value of Eq, which is given 
by a variant of eq. [U» 



b + ai0_i + a 2 (f>-2 H h <%>0-p = £o • 



(18) 



Equations |l5] and [l?] illustrate the simplicity of the 
method. 



III. SOME PROPERTIES 

A clear advantage of the method over conventional se- 
ries expansions is the automatic positive definite charac- 
ter of the pdf. Another asset is the close resemblance 
with autoregressive or maximum entropy methods that 
are nowadays widely used in the estimation of spectral 
densities. Both methods have in common the estima- 
tion of a positive function by means of a Pade approx- 
imant whose coefficients directly issue from a regression 
(eq. |l6|). This analogy allows us to exploit here some 
results previously obtained in the framework of spectral 
analysis. 

One of these concerns the statistical properties of 
the maximum likelihood estimate. These properties are 
badly known because the nonlinearity of the problem im- 
pedes any analytical treatment. The analogy with spec- 
tral densities, however, reveals that the estimates are 
asymptotically normally distributed with a standard de- 
viation MM 



cj I cx / 



(19) 



This scaling should be compared against that of conven- 
tional kernel estimates, for which 



f ■ 



(20) 



The key point is that kernel estimates are relatively less 
reliable in low density regions than in the bulk of the dis- 
tribution, whereas the relative uncertainty of maximum 



likelihood estimates is essentially constant. The latter 
property is obviously preferable when the tails of the dis- 
tribution must be investigated, e.g. in the study of rare 
events. 

Some comments are now in order. By choosing a 
Fourier series expansion, we have implicitly assumed that 
the pdf was 27r-periodic, which is not necessarily the case. 
Thus special care is needed to enforce periodicity, since 
otherwise wraparound may result p5[ . The solution to 
this problem depends on how easily the pdf can be ex- 
tended periodically. In most applications, the tails of the 
distribution progressively decrease to zero, so periodicity 
may be enforced simply by artificially padding the tails 
with a small interval in which the density vanishes. We 
do this by rescaling the support from [a, b] to an interval 
which is slightly smaller than 2ir, say [—3,3] p^j . Once 
the Pade approximant is known, the [—3, 3] interval is 
scaled back to [a, b}. 

If there is no natural periodic extension to the pdf, 
(for example if f(a) strongly differs from /(£>)) then the 
choice of Fourier basis functions in eq. [| becomes ques- 
tionable and, not surprisingly, the quality of the fit de- 
grades. Even in this case, however, the results can still 
be improved by using ad hoc solutions [ p7| . 

We mentioned before that the maximum likelihood 
method stands out by computational simplicity. Indeed, 
a minimization of the entropy would lead to the solution 



p 

log f p (x) cx c ke~ jkx 

k=—p 



(21) 



whose numerical implementation requires an iterative 
minimization and is therefore considerably more demand- 
ing. 

Finally, the computational cost is found to be compa- 
rable or even better (for large sets) than for conventional 
histogram estimates. Most of the computation time goes 
into the calculation of the characteristic function, for 
which the number of operations scales as the sample size 
n. 



IV. CHOOSING THE ORDER OF THE MODEL 

The larger the order p of the model is, the finer the 
details in the pdf estimate are. Finite sample effects, 
however, also increase with p. It is therefore of prime 
importance to find a compromise. Conventional criteria 
for selecting the best compromise between model com- 
plexity and quality of the fit, such as the Final Predic- 
tion Error and the Minimum Description Length |l6|-|T§|] 
are not applicable here because they require the series of 
characteristic functions {<pk\ to be normally distributed, 
which they are not. 

Guided by the way these empirical criteria have been 
chosen, we have defined a new one, which is based on 
the following observation: as p increases starting from 0, 
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the pdfs f p (x) progressively converge toward a station- 
ary shape; after some optimal order, however, ripples ap- 
pear and the shapes start diverging again. It is therefore 
reasonable to compare the pdfs pairwise and determine 
how close they are. A natural measure of closeness be- 
tween two positive distributions f p (x) and f v + \(x) is the 
Kullback-Lcibler entropy or information gain pH,E9| 



+ " fp+iW log%^dx 

-T Jp\ X ) 



(22) 



which quantifies the amount of information gained by 
changing the probability density describing our sample 
from f p to fp+\- In other words, if H p (or -ffp+i) is the 
hypothesis that x was selected from the population whose 
probability density is f p (f p +i), then I(f p+1 ,f p ) is given 
as the mean information for discriminating between H p +i 
and Hp per observation from f p +i [p8[ . 

Notice that the information gain is not a distance be- 
tween distributions; it nevertheless has the property of 
being non negative and to vanish if and only if f p = f p +i- 
We now proceed as follows : starting from p = the order 
is incremented until the information gain reaches a clear 
minimum; this corresponds, as it has been checked nu- 
merically, to the convergence toward a stationary shape; 
the corresponding order is then taken as the requested 
compromise. Clearly, there is some arbitrariness in the 
definition of a such a minimum since visual inspection 
and common sense are needed. In most cases, however, 
the solution is evident and the search can be automated. 
Optimal orders usually range between 2 and 10; larger 
values may be needed to model discontinuous or com- 
plex shaped densities. 



V. SOME EXAMPLES 

Three examples are now given in order to illustrate the 
limits and the advantages of the method. 



A. General properties 

First, we consider a normal distribution with exponen- 
tial tails as often encountered in turbulent wavefields. We 
simulated a random sample with n = 2000 elements and 
the main results appear in Fig. El 

The information gain (Fig. |l]b) decreases as expected 
until it reaches a well defined minimum at p = 7, which 
therefore sets the optimal order of our model. Since the 
true pdf is known, we can test this result against a com- 
mon measure of the quality of the fit, which is the Mean 
Integrated Squared Error (MISE) 



The MISE, which is displayed in Fig. [l]b, also reaches 
a minimum at p — 8 and thus supports the choice of 
the information gain as a reliable indicator for the best 
model. Tests carried out on other types of distributions 
confirm this good agreement. 

Now that the optimum pdf has been found, its charac- 
teristic function can be computed and compared with the 
measured one, see Fig. EJa. As expected, the two charac- 
teristic functions coincide for the p lowest wave numbers 
(cq. 16); they diverge at higher wave numbers, for which 
the model tries to extrapolate the characteristic function 
sclf-consistently. The fast falloff of the maximum like- 
lihood estimate explains the relatively smooth shape of 
the resulting pdf. 

Finally, the quality of the pdf can be visualized in 
Fig. ^d, which compares the measured pdf with the true 
one, and an estimate based on a histogram with 101 bins. 
An excellent agreement is obtained, both in the bulk of 
the distribution and in the tails, where the exponential 
falloff is correctly reproduced. This example illustrates 
the ability of the method to get reliable estimates in re- 
gions where standard histogram approaches have a lower 
performance. 



B. Interpreting the characteristic function 

The shape of the characteristic function in Fig. [l]a is 
reminiscent of spectral densities consisting of a low wave 
number (band-limited) component embedded in broad- 
band noise. A straightforward calculation of the expec- 
tation of \<f>k | indeed reveals the presence of a bias which 
is due to the finite sample size 



E[ 



b k \] = + 

Jn 



(24) 



MISE(p) = / [f(x) - f p (x)] 2 dx 



(23) 



where 7 depends on the degree of independence between 
the samples in {x}. This bias is illustrated in Fig. || 
for independent variables drawn from a normal distribu- 
tion, showing how the wave number resolution gradually 
degrades as the sample size decreases. Incidentally, a 
knowledge of the bias level could be used to obtain con- 
fidence intervals for the pdf estimate. This would be 
interesting insofar no assumptions have to be made on 
possible correlations in the data set. We found this ap- 
proach, however, to be too inaccurate on average to be 
useful. 

The presence of a bias also gives an indication of the 
smallest scales (in terms of amplitude of x) one can re- 
liably distinguish in the pdf. For a set of 2000 samples 
drawn from a normal distribution, for example, compo- 
nents with wave numbers in excess of k = 3 are hidden 
by noise and hence the smallest meaningful scales in the 
pdf are of the order of 8x — 0.33. These results could 
possibly be further improved by Wiener filtering. 
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C. Influence of the sample size 

To investigate the effect of the sample length n, we 
now consider a bimodal distribution consisting of two 
normal distributions with different means and standard 
deviations. Such distributions are known to be difficult 
to handle with kernel estimators. 

Samples with respectively n = 200, n = 2000 and 
n = 20000 elements were generated; their characteristic 
functions and the resulting pdfs are displayed in Fig. 
Clearly, finite sample effects cannot be avoided for small 
samples but the method nevertheless succeeds relatively 
well in capturing the true pdf and in particular the small 
peak associated with the narrow distribution. An anal- 
ysis of the MISE shows that it is systematically lower 
for maximum likelihood estimates than for standard his- 
togram estimates, supporting the former. 



D. A counterexample 

The previous examples gave relatively good results be- 
cause the true distributions were rather smooth. Al- 
though such smooth distributions are generic in most 
applications it may be instructive to look at a counterex- 
ample, in which the method fails. 

Consider the distribution which corresponds to a cut 
through an annulus 



| 1 < \x\ <2 
elsewhere 



(25) 



A sample was generated with n = 2000 elements and the 
resulting information gains are shown in Fig. [|. There 
is an ambiguity in the choice of the model order and 
indeed the convergence of the pdf estimates toward the 
true pdf is neither uniform nor in the mean. Increasing 
the order improves the fit of the discontinuity a little 
but also increases the oscillatory behavior known as the 
Gibbs phenomenon. This problem is related to the fact 
that the pdf is discontinuous and hence the characteristic 
function is not absolutely summable. 

Similar problems are routinely encountered in the de- 
sign of digital filters, where steep responses cannot be 
approximated with infinite impulse response filters that 
have a limited number of poles p0[ . The bad perfor- 
mance of the maximum likelihood approach in this case 
also comes from its inability to handle densities that van- 
ish over finite intervals. A minimization of the entropy 
would be more appropriate here. 



hood of the pdf subject to the constraint that the char- 
acteristic functions of the sample and estimated pdfs co- 
incide for a given number of terms. Such a global ap- 
proach to the estimation of pdfs is in contrast to the bet- 
ter known local methods (such as non-parametric kernel 
methods) whose performance is poorer in regions where 
there is a lack of statistics, such as the tails of the distri- 
bution. This difference makes the maximum likelihood 
method relevant for the analysis of short records (with 
typically hundreds or thousands of samples). Other ad- 
vantages include a simple computational procedure that 
can be tuned with a single parameter. An entropy-based 
criterion has been developed for selecting the latter. 

The method works best with densities that are at least 
once continuously differentiable and that can be extended 
periodically. Indeed, the shortcomings of the method are 
essentially the same as for autoregressive spectral esti- 
mates, which give rise to the Gibbs phenomenon if the 
density is discontinuous. 

The method can be extended to multivariate densities, 
but the computational procedures are not yet within the 
realm of practical usage. Its numerous analogies with 
the design of digital filters suggest that it is still open to 
improvements. 
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APPENDIX: 

We detail here the main stages that lead to the pdf 
estimate described in Sec. [n| because extensive proofs are 
rather difficult to find in the literature. 

The maximum likelihood condition (eq. |l^) can be ex- 
pressed as 



+ 7T 



-jkx 



OO a _ A J ™ 



dx = 



e J 



fix) 



dx = , (Al) 



for \k\ > p (30[. This simply means that the Fourier 
expansion of (f(x)\ should not contain terms of order 
I As I > p and hence the solution must be 



VI. CONCLUSION 

We have presented a parametric procedure for esti- 
mating univariate densities using a positive definite func- 
tional. The method proceeds by maximizing the likeli- 



f P (x) 



1 



-jkx 



(A2) 



The pdf we are looking for must of course be real, and so 
the coefficients should be hermitian Ck — c*_ k . We also 
want the pdf to be bounded, which implies that 
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p 

E 

k=—p 



C k e- jkX ^ , Vie [-7T, tt] 



(A3) 



Let us now define, for z complex 
C(z) = 



E 

k——p 



C-kZ 



and 



P(z) = z p C(z) 



(A4) 



(A5) 



P(z) is a polynomial of degree 2p. It can be easily verified 
that |3Ul 



P(z) 



„2p 



P 



1 



(A6) 



as a consequence of the hermiticity of the coefficients 
In particular, this tells us that if z\ is a root of P(z), then 
1/z* (the complex-conjugate of its mirror image with re- 
spectto the unit circle) is also a root of P(z). From 



eq. A3 we know that none of these roots are located on 
the unit circle. 

Let us now rearrange the roots of P(z), denoting by 
{zi, . . . , z p } the p roots lying outside the unit disk and by 
{1/z*, . . . , 1/z*} the p other ones that are located inside 
the unit circle. We can then write: 



P(z) = c- p (z - zi) ■ ■■ (z - z p ) ( z 
with 

* * 

C—pZ\ • ■ ■ Zp — CpZ-y ' ' ' Zp 

From this C{z) can be written as: 

1 



1 



C(z) = ±B(z) 



z* 



where 



B(z) 



z\---z v 



(Z - Zl) ■ ■ ■ (Z - Zp) 



1 

z* 
p 

(A7) 
(A8) 

(A9) 
(A10) 



By construction, all the roots of B(z) are located outside 
the unit disk. 

Finally, we get for f P {x): 



fp(%) 



1 



C(z = eJ x ) 



±- 



1 



\B{ef*)\' 



(All) 



All the solutions of the maximum likelihood principle, if 
real and bounded, are thus of constant sign and have the 
structure given by eq. All. Excluding negative definite 
solutions we obtain 



£0 



fp( x ) = 



where 



£0 



2tt |l 



2?r 



1 

aie-i x H V a v e~^\ 



2 • 



bo 



b* 



1. 



(A12) 



(A13) 



where {61, . . . , b p } are the coefficients of the polynomial 

B(z) and A{z) = 1 + a\z H + a p z p has all its roots 

outside the unit disk. The normalization constant £0 is 
set by the condition 



f p (x)dx = 1 . 



(A14) 



The coefficients {ai, . . . ,a p } are now identified on the 
basis that the characteristic function of the pdf estimate 
f p (x) should match the first p terms of the sample char- 
acteristic function exactly, namely 



f p (x) e jkx Ax , 1 < k < p . (A15) 



To this purpose, let us compute the quantity 
zCfe=o a kO i i-k- Recalling that A(z) is analytic in the unit 
circle and making use of Cauchy's residue theorem, we 
obtain 



ELo a k<l>i-k = . 

Efc=0 a k<f>-k = £0 



1 < I <P 



Equation A16| fixes the values of {a 1; 
the Yule- Walker equations (eq. O). 
unique provided that 



(A16) 
(A17) 

. . , a p } and gives 
The solution is 



det 



-P+i 



?0. 



(A18) 



The latter condition is verified except when a repetitive 
pattern occurs in the characteristic function. In this hap- 
pens then the order p should simply be chosen to be less 
than the periodicity of this pattern. 

Besides its positivity, the solution we obtain has a num- 
ber of useful properties. First, note that all the terms of 
its characteristic function can be computed recursively 
by 











&k 




_ Ctk-p+2 _ 





1 
1 



-a p 











OLk 








_ cife-p+i _ 



(A19) 

in which the starting condition is set by the p first values 
of <pk. From this recurrence relation the asymptotic be- 
havior of 4>k as k — > 00 can be probed by diagonalizing 



G 



the state space matrix in eq. 



this matrix are the roots {1/ z\, 
which by construction are all inside the unit disk 
fore 



A19. The eigenvalues of 
l/z*} (called poles), 
There- 



lim 



(A20) 



where A is related to the largest root and is always neg- 
ative since 



max log 

k 



1 



< . 



(A21) 



This exponential falloff of the characteristic function ex- 
plains why the resulting pdf is relatively smooth. 

Now that we have found a solution in terms of a [0,p] 
Pade approximant, it is legitimate to ask whether a [q,p] 
approximant of the type 



fq,p(x) 



die' 



\l + aie-i x -\ \-a p e-iP x \ 



(A22) 



could not bring additional flexibility and hence provide a 
better estimate of the pdf. Again, we exploit the analogy 
with spectral density estimation, in which the equivalent 
of [q,p] Pade approximants are obtained with autoregres- 
sive moving average (ARMA) models. The superiority of 
ARMA over AR models is generally agreed upon |32| , al- 
though the MISE does not firmly establish it JlTj . Mean- 
while we note that there does not seem to exist a sim- 
ple variational principle, similar to that of the likelihood 
maximization, which naturally leads to a [q,p] Pade ap- 
proximant of the pdf. 
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FIG. 1. Example of a normal distribution with exponential 
tails. The sample size is n — 2000. From top to bottom are 
shown: (a) the magnitude \<pk\ of the characteristic function 
(thick line) and the fit resulting from an 7'th order model; (b) 
the information gain (thick line) and the MISE, both showing 
a minimum around p = 7 which is marked by a circle; (c) the 
likelihood L associated with the different pdfs estimated for 
p =1-20; and finally (d) the maximum likelihood estimate of 
the pdf (thick line), an estimate based on a histogram with 
101 equispaced bins (dots) and the true pdf (thin line). 
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FIG. 2. The expectation E[|</>fc|] computed for sets of var- 
ious sizes taken from the same normal distribution. The 
noise-induced bias level goes down as the size increases, pro- 
gressively revealing finer details of the pdf. 
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FIG. 3. The pdfs as calculated for sets of various sizes taken 
from the same bi-normal distribution. The thick line desig- 
nates the maximum likelihood estimate, the thin line the true 
pdf and the dots a histogram estimate obtained from 61 eq- 
uispaced bins. The optimum orders are respectively from top 
to bottom p — 5, p = 6 and p = 11. 
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FIG. 4. Results obtained for an annular distribution; the 
sample size is n — 2000. In (a) The information gain has no 
clear minimum and hence there is no well defined order for the 
model. In (b) the estimated pdfs for p = 1 and p = 2 fail to fit 
the true pdf (thick line). Increasing the order (c) improves the 
fit but also enhances the Gibbs phenomenon. Dots correspond 
to a histogram estimate with equispaced bins. 
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